German Credit Score Classification Model Explainability

By: Krishna J

Importing necessary libraries

In [1]:
import pandas as pd
import numpy as np
import seaborn               as sns
import matplotlib.pyplot     as plt
import shap
import eli5
from sklearn.model_selection import train_test_split
#from sklearn.ensemble        import RandomForestClassifier
#from sklearn.linear_model    import LogisticRegression
from sklearn.preprocessing   import MinMaxScaler, StandardScaler
from sklearn.base            import TransformerMixin
from sklearn.pipeline        import Pipeline, FeatureUnion
from typing                  import List, Union, Dict
# Warnings will be used to silence various model warnings for tidier output
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline 
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
np.random.seed(0)
pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.

Importing the source dataset

Source:

https://archive.ics.uci.edu/ml/datasets/Statlog+%28German+Credit+Data%29

Professor Dr. Hans Hofmann Institut f"ur Statistik und "Okonometrie Universit"at Hamburg FB Wirtschaftswissenschaften Von-Melle-Park 5 2000 Hamburg 13

This file has been edited and several indicator variables added to make it suitable for algorithms which cannot cope with categorical variables. Several attributes that are ordered categorical (such as attribute 17) have been coded as integer.

In [2]:
feature_list = ['CurrentAcc', 'NumMonths', 'CreditHistory', 'Purpose', 'CreditAmount', 
         'Savings', 'EmployDuration', 'PayBackPercent', 'Gender', 'Debtors', 
         'ResidenceDuration', 'Collateral', 'Age', 'OtherPayBackPlan', 'Property', 
         'ExistingCredit', 'Job', 'Dependents', 'Telephone', 'Foreignworker', 'CreditStatus']

german_xai = pd.read_csv('C:/Users/krish/Downloads/german.data.txt',names = feature_list, delimiter=' ')
In [3]:
german_xai.head()
Out[3]:
CurrentAcc NumMonths CreditHistory Purpose CreditAmount Savings EmployDuration PayBackPercent Gender Debtors ... Collateral Age OtherPayBackPlan Property ExistingCredit Job Dependents Telephone Foreignworker CreditStatus
0 A11 6 A34 A43 1169 A65 A75 4 A93 A101 ... A121 67 A143 A152 2 A173 1 A192 A201 1
1 A12 48 A32 A43 5951 A61 A73 2 A92 A101 ... A121 22 A143 A152 1 A173 1 A191 A201 2
2 A14 12 A34 A46 2096 A61 A74 2 A93 A101 ... A121 49 A143 A152 1 A172 2 A191 A201 1
3 A11 42 A32 A42 7882 A61 A74 2 A93 A103 ... A122 45 A143 A153 1 A173 2 A191 A201 1
4 A11 24 A33 A40 4870 A61 A73 3 A93 A101 ... A124 53 A143 A153 2 A173 2 A191 A201 2

5 rows × 21 columns

In [4]:
german_xai.shape
Out[4]:
(1000, 21)

The dataset has 1000 entries with 21 fields.

In [5]:
type(german_xai)
Out[5]:
pandas.core.frame.DataFrame
In [6]:
german_xai.head(10)

german_xai.columns
Out[6]:
CurrentAcc NumMonths CreditHistory Purpose CreditAmount Savings EmployDuration PayBackPercent Gender Debtors ... Collateral Age OtherPayBackPlan Property ExistingCredit Job Dependents Telephone Foreignworker CreditStatus
0 A11 6 A34 A43 1169 A65 A75 4 A93 A101 ... A121 67 A143 A152 2 A173 1 A192 A201 1
1 A12 48 A32 A43 5951 A61 A73 2 A92 A101 ... A121 22 A143 A152 1 A173 1 A191 A201 2
2 A14 12 A34 A46 2096 A61 A74 2 A93 A101 ... A121 49 A143 A152 1 A172 2 A191 A201 1
3 A11 42 A32 A42 7882 A61 A74 2 A93 A103 ... A122 45 A143 A153 1 A173 2 A191 A201 1
4 A11 24 A33 A40 4870 A61 A73 3 A93 A101 ... A124 53 A143 A153 2 A173 2 A191 A201 2
5 A14 36 A32 A46 9055 A65 A73 2 A93 A101 ... A124 35 A143 A153 1 A172 2 A192 A201 1
6 A14 24 A32 A42 2835 A63 A75 3 A93 A101 ... A122 53 A143 A152 1 A173 1 A191 A201 1
7 A12 36 A32 A41 6948 A61 A73 2 A93 A101 ... A123 35 A143 A151 1 A174 1 A192 A201 1
8 A14 12 A32 A43 3059 A64 A74 2 A91 A101 ... A121 61 A143 A152 1 A172 1 A191 A201 1
9 A12 30 A34 A40 5234 A61 A71 4 A94 A101 ... A123 28 A143 A152 2 A174 1 A191 A201 2

10 rows × 21 columns

Out[6]:
Index(['CurrentAcc', 'NumMonths', 'CreditHistory', 'Purpose', 'CreditAmount',
       'Savings', 'EmployDuration', 'PayBackPercent', 'Gender', 'Debtors',
       'ResidenceDuration', 'Collateral', 'Age', 'OtherPayBackPlan',
       'Property', 'ExistingCredit', 'Job', 'Dependents', 'Telephone',
       'Foreignworker', 'CreditStatus'],
      dtype='object')

List of fields in the source dataset are listed above

In [7]:
german_xai.dtypes
Out[7]:
CurrentAcc           object
NumMonths             int64
CreditHistory        object
Purpose              object
CreditAmount          int64
Savings              object
EmployDuration       object
PayBackPercent        int64
Gender               object
Debtors              object
ResidenceDuration     int64
Collateral           object
Age                   int64
OtherPayBackPlan     object
Property             object
ExistingCredit        int64
Job                  object
Dependents            int64
Telephone            object
Foreignworker        object
CreditStatus          int64
dtype: object

Datatypes of each field is displayed above

Missing Value Check

In [8]:
import klib
klib.missingval_plot(german_xai)
No missing values found in the dataset.

Feature Engineering

Dropping the unnecessary fields and fields that has constant value. From given data, CurrentAcc is the attribute that stores the information related to status of checking the account details. Let us drop this field since it is not related to credit score being good or bad.

In [9]:
german_xai.drop(['CurrentAcc'],inplace=True,axis=1)
In [10]:
german_xai.head()
Out[10]:
NumMonths CreditHistory Purpose CreditAmount Savings EmployDuration PayBackPercent Gender Debtors ResidenceDuration Collateral Age OtherPayBackPlan Property ExistingCredit Job Dependents Telephone Foreignworker CreditStatus
0 6 A34 A43 1169 A65 A75 4 A93 A101 4 A121 67 A143 A152 2 A173 1 A192 A201 1
1 48 A32 A43 5951 A61 A73 2 A92 A101 2 A121 22 A143 A152 1 A173 1 A191 A201 2
2 12 A34 A46 2096 A61 A74 2 A93 A101 3 A121 49 A143 A152 1 A172 2 A191 A201 1
3 42 A32 A42 7882 A61 A74 2 A93 A103 4 A122 45 A143 A153 1 A173 2 A191 A201 1
4 24 A33 A40 4870 A61 A73 3 A93 A101 4 A124 53 A143 A153 2 A173 2 A191 A201 2

Encoding categorical fields

1. Mapping to actual description

Here, first we are mapping the encrypted domain values of each field to its corresponding actual values depending on the description provided in the UCI machine learning repository.

Gender field desc:

  • A91 : male : divorced/separated;
  • A92 : female : divorced/separated/married;
  • A93 : male : single;
  • A94 : male : married/widowed;
  • A95 : female : single. Male is encoded as 1 and female as 0.

Creating new field marital status to study the impact as protected attribute.

In [11]:
german_xai['Gender'].value_counts()
german_xai['Marital_Status']=german_xai['Gender']
#german_xai.replace({'Marital_Status':{'A93':'Single','A91':'divorced/married/widowed','A92':'divorced/married/widowed','A94':'divorced/married/widowed'},'Gender':{'A91':'1','A93':'1','A94':'1','A92':'0'}},inplace=True)
german_xai.replace({'Marital_Status':{'A93':'1','A91':'0','A92':'0','A94':'0'},'Gender':{'A91':'1','A93':'1','A94':'1','A92':'0'}},inplace=True)
german_xai.replace({'Gender':{'A91':'1','A93':'1','A94':'1','A92':'0'}},inplace=True)
german_xai['Gender'].value_counts()
german_xai['Marital_Status'].value_counts()
Out[11]:
A93    548
A92    310
A94     92
A91     50
Name: Gender, dtype: int64
Out[11]:
1    690
0    310
Name: Gender, dtype: int64
Out[11]:
1    548
0    452
Name: Marital_Status, dtype: int64
In [12]:
#german_xai['Age'].value_counts()
german_xai['Age']=german_xai['Age'].apply(lambda x: np.int(x >= 26))
german_xai['Age'].value_counts()
Out[12]:
1    810
0    190
Name: Age, dtype: int64

Entries with age greater than or equal to 26yrs is encoded as 1 otherwise 0

In [13]:
#Encoding target field
german_xai.CreditStatus.value_counts()
german_xai['CreditStatus'].replace({1:1 , 2: 0}, inplace=True)
german_xai.CreditStatus.value_counts()
Out[13]:
1    700
2    300
Name: CreditStatus, dtype: int64
Out[13]:
1    700
0    300
Name: CreditStatus, dtype: int64

Target field CreditStatus is encoded as 1 = Good, 0 = Bad ; in actual data 1 = Good, 2 = Bad.

Employment duration desc:

  • A71 : unemployed;
  • A72 : ... < 1 year;
  • A73 : 1 <= ... < 4 years;
  • A74 : 4 <= ... < 7 years;
  • A75 : .. >= 7 years.
In [14]:
german_xai['EmployDuration'].replace({'A71':'unemployed' , 'A72': '<1 yr','A73': '1-4 yr','A74': '4-7 yr', 'A75': '>=7 yr'}, inplace=True)
german_xai.EmployDuration.value_counts()
Out[14]:
1-4 yr        339
>=7 yr        253
4-7 yr        174
<1 yr         172
unemployed     62
Name: EmployDuration, dtype: int64

Credit History desc:

  • A30 : no credits taken/ all credits paid back duly,
  • A31 : all credits at this bank paid back duly,
  • A32 : existing credits paid back duly till now,
  • A33 : delay in paying off in the past,
  • A34 : critical account/ other credits existing (not at this bank).
In [15]:
german_xai['CreditHistory'].replace({'A30':'none/paid' , 'A31': 'none/paid','A32': 'none/paid','A33': 'Delay', 'A34': 'other'}, inplace=True)
german_xai['CreditHistory'].value_counts()
Out[15]:
none/paid    619
other        293
Delay         88
Name: CreditHistory, dtype: int64

Savings Desc:

  • A61 : ... < 100 DM
  • A62 : 100 <= ... < 500 DM
  • A63 : 500 <= ... < 1000 DM
  • A64 : .. >= 1000 DM
  • A65 : unknown/ no savings account
In [16]:
german_xai['Savings'].replace({'A61':'<500' , 'A62': '<500','A63': '500+','A64': '500+', 'A65': 'none'}, inplace=True)
german_xai['Savings'].value_counts()
Out[16]:
<500    706
none    183
500+    111
Name: Savings, dtype: int64

Debtors desc: Other debtors / guarantors

  • A101 : none
  • A102 : co-applicant
  • A103 : guarantor
In [17]:
german_xai['Debtors'].replace({'A101':'none' , 'A102': 'co-applicant','A103': 'guarantor'}, inplace=True)
german_xai['Debtors'].value_counts()
Out[17]:
none            907
guarantor        52
co-applicant     41
Name: Debtors, dtype: int64

Collateral desc:

  • A121 : real estate
  • A122 : if not A121 : building society savings agreement/ life insurance
  • A123 : if not A121/A122 : car or other, not in attribute 6
  • A124 : unknown / no property
In [18]:
german_xai['Collateral'].replace({'A121':'real estate' , 'A122': 'savings/life insurance','A123': 'car/other', 'A124':'unknown/none'}, inplace=True)
german_xai['Collateral'].value_counts()
Out[18]:
car/other                 332
real estate               282
savings/life insurance    232
unknown/none              154
Name: Collateral, dtype: int64

Property: Housing

  • A151 : rent
  • A152 : own
  • A153 : for free
In [19]:
german_xai['Property'].replace({'A151':'rent' , 'A152': 'own','A153': 'free'}, inplace=True)
german_xai['Property'].value_counts()
Out[19]:
own     713
rent    179
free    108
Name: Property, dtype: int64

Telephone desc:

  • A191 : none
  • A192 : yes, registered under the customers name

Foreign worker

  • A201 : yes
  • A202 : no
In [20]:
german_xai['Foreignworker'].replace({'A201':1 , 'A202': 0}, inplace=True)
german_xai['Telephone'].replace({'A191':0 , 'A192': 1}, inplace=True)
german_xai['Telephone'].value_counts()
german_xai['Foreignworker'].value_counts()
Out[20]:
0    596
1    404
Name: Telephone, dtype: int64
Out[20]:
1    963
0     37
Name: Foreignworker, dtype: int64

Purpose desc:

  • A40 : car (new)
  • A41 : car (used)
  • A42 : furniture/equipment
  • A43 : radio/television
  • A44 : domestic appliances
  • A45 : repairs
  • A46 : education
  • A47 : (vacation - does not exist?)
  • A48 : retraining
  • A49 : business
  • A410 : others
In [21]:
german_xai['Purpose'].replace({'A40':'CarNew' , 'A41': 'CarUsed' , 'A42': 'furniture/equip','A43':'radio/tv','A44':'domestic app','A45':'repairs','A46':'education','A47':'vacation','A48':'retraining','A49':'biz','A410':'others'}, inplace=True)
german_xai['Purpose'].value_counts()
Out[21]:
radio/tv           280
CarNew             234
furniture/equip    181
CarUsed            103
biz                 97
education           50
repairs             22
domestic app        12
others              12
retraining           9
Name: Purpose, dtype: int64

Job desc:

  • A171 : unemployed/ unskilled - non-resident
  • A172 : unskilled - resident
  • A173 : skilled employee / official
  • A174 : management/ self-employed/highly qualified employee/ officer
In [22]:
german_xai['Job'].replace({'A171':'unemp/unskilled-non resident' , 'A172': 'unskilled-resident','A173': 'skilled employee','A174':'management/self-emp/officer/highly qualif emp'}, inplace=True)
german_xai['Job'].value_counts()
Out[22]:
skilled employee                                 630
unskilled-resident                               200
management/self-emp/officer/highly qualif emp    148
unemp/unskilled-non resident                      22
Name: Job, dtype: int64

Other installment plans desc

  • A141 : bank
  • A142 : stores
  • A143 : none
In [23]:
german_xai['OtherPayBackPlan'].replace({'A141':'bank' , 'A142': 'stores','A143': 'none'}, inplace=True)
german_xai['OtherPayBackPlan'].value_counts()
Out[23]:
none      814
bank      139
stores     47
Name: OtherPayBackPlan, dtype: int64
In [24]:
german_xai.head()
Out[24]:
NumMonths CreditHistory Purpose CreditAmount Savings EmployDuration PayBackPercent Gender Debtors ResidenceDuration ... Age OtherPayBackPlan Property ExistingCredit Job Dependents Telephone Foreignworker CreditStatus Marital_Status
0 6 other radio/tv 1169 none >=7 yr 4 1 none 4 ... 1 none own 2 skilled employee 1 1 1 1 1
1 48 none/paid radio/tv 5951 <500 1-4 yr 2 0 none 2 ... 0 none own 1 skilled employee 1 0 1 0 0
2 12 other education 2096 <500 4-7 yr 2 1 none 3 ... 1 none own 1 unskilled-resident 2 0 1 1 1
3 42 none/paid furniture/equip 7882 <500 4-7 yr 2 1 guarantor 4 ... 1 none free 1 skilled employee 2 0 1 1 1
4 24 Delay CarNew 4870 <500 1-4 yr 3 1 none 4 ... 1 none free 2 skilled employee 2 0 1 0 1

5 rows × 21 columns

In [25]:
german_xai = german_xai.reindex(columns=['NumMonths', 'CreditHistory', 'Purpose', 'CreditAmount', 
         'Savings', 'EmployDuration', 'PayBackPercent', 'Gender', 'Debtors', 
         'ResidenceDuration', 'Collateral', 'Age', 'OtherPayBackPlan', 'Property', 
         'ExistingCredit', 'Job', 'Dependents', 'Telephone', 'Foreignworker', 'Marital_Status', 'CreditStatus'])
german_xai.head()
Out[25]:
NumMonths CreditHistory Purpose CreditAmount Savings EmployDuration PayBackPercent Gender Debtors ResidenceDuration ... Age OtherPayBackPlan Property ExistingCredit Job Dependents Telephone Foreignworker Marital_Status CreditStatus
0 6 other radio/tv 1169 none >=7 yr 4 1 none 4 ... 1 none own 2 skilled employee 1 1 1 1 1
1 48 none/paid radio/tv 5951 <500 1-4 yr 2 0 none 2 ... 0 none own 1 skilled employee 1 0 1 0 0
2 12 other education 2096 <500 4-7 yr 2 1 none 3 ... 1 none own 1 unskilled-resident 2 0 1 1 1
3 42 none/paid furniture/equip 7882 <500 4-7 yr 2 1 guarantor 4 ... 1 none free 1 skilled employee 2 0 1 1 1
4 24 Delay CarNew 4870 <500 1-4 yr 3 1 none 4 ... 1 none free 2 skilled employee 2 0 1 1 0

5 rows × 21 columns

2.Converting categorical fields to numerical fields

In [26]:
german_xai=pd.get_dummies(german_xai,columns=['CreditHistory','Purpose','Savings','EmployDuration','Debtors','Collateral','OtherPayBackPlan','Property','Job'])
german_xai.head()
Out[26]:
NumMonths CreditAmount PayBackPercent Gender ResidenceDuration Age ExistingCredit Dependents Telephone Foreignworker ... OtherPayBackPlan_bank OtherPayBackPlan_none OtherPayBackPlan_stores Property_free Property_own Property_rent Job_management/self-emp/officer/highly qualif emp Job_skilled employee Job_unemp/unskilled-non resident Job_unskilled-resident
0 6 1169 4 1 4 1 2 1 1 1 ... 0 1 0 0 1 0 0 1 0 0
1 48 5951 2 0 2 0 1 1 0 1 ... 0 1 0 0 1 0 0 1 0 0
2 12 2096 2 1 3 1 1 2 0 1 ... 0 1 0 0 1 0 0 0 0 1
3 42 7882 2 1 4 1 1 2 0 1 ... 0 1 0 1 0 0 0 1 0 0
4 24 4870 3 1 4 1 2 2 0 1 ... 0 1 0 1 0 0 0 1 0 0

5 rows × 50 columns

In [27]:
german_xai.columns
Out[27]:
Index(['NumMonths', 'CreditAmount', 'PayBackPercent', 'Gender',
       'ResidenceDuration', 'Age', 'ExistingCredit', 'Dependents', 'Telephone',
       'Foreignworker', 'Marital_Status', 'CreditStatus',
       'CreditHistory_Delay', 'CreditHistory_none/paid', 'CreditHistory_other',
       'Purpose_CarNew', 'Purpose_CarUsed', 'Purpose_biz',
       'Purpose_domestic app', 'Purpose_education', 'Purpose_furniture/equip',
       'Purpose_others', 'Purpose_radio/tv', 'Purpose_repairs',
       'Purpose_retraining', 'Savings_500+', 'Savings_<500', 'Savings_none',
       'EmployDuration_1-4 yr', 'EmployDuration_4-7 yr',
       'EmployDuration_<1 yr', 'EmployDuration_>=7 yr',
       'EmployDuration_unemployed', 'Debtors_co-applicant',
       'Debtors_guarantor', 'Debtors_none', 'Collateral_car/other',
       'Collateral_real estate', 'Collateral_savings/life insurance',
       'Collateral_unknown/none', 'OtherPayBackPlan_bank',
       'OtherPayBackPlan_none', 'OtherPayBackPlan_stores', 'Property_free',
       'Property_own', 'Property_rent',
       'Job_management/self-emp/officer/highly qualif emp',
       'Job_skilled employee', 'Job_unemp/unskilled-non resident',
       'Job_unskilled-resident'],
      dtype='object')
In [28]:
german_xai = german_xai.reindex(columns=['NumMonths', 'CreditAmount', 'PayBackPercent', 'Gender',
       'ResidenceDuration', 'Age', 'ExistingCredit', 'Dependents', 'Telephone',
       'Foreignworker', 'Marital_Status',
       'CreditHistory_Delay', 'CreditHistory_none/paid', 'CreditHistory_other',
       'Purpose_CarNew', 'Purpose_CarUsed', 'Purpose_biz',
       'Purpose_domestic app', 'Purpose_education', 'Purpose_furniture/equip',
       'Purpose_others', 'Purpose_radio/tv', 'Purpose_repairs',
       'Purpose_retraining', 'Savings_500+', 'Savings_<500', 'Savings_none',
       'EmployDuration_1-4 yr', 'EmployDuration_4-7 yr',
       'EmployDuration_<1 yr', 'EmployDuration_>=7 yr',
       'EmployDuration_unemployed', 'Debtors_co-applicant',
       'Debtors_guarantor', 'Debtors_none', 'Collateral_car/other',
       'Collateral_real estate', 'Collateral_savings/life insurance',
       'Collateral_unknown/none', 'OtherPayBackPlan_bank',
       'OtherPayBackPlan_none', 'OtherPayBackPlan_stores', 'Property_free',
       'Property_own', 'Property_rent',
       'Job_management/self-emp/officer/highly qualif emp',
       'Job_skilled employee', 'Job_unemp/unskilled-non resident',
       'Job_unskilled-resident', 'CreditStatus'])
german_xai.head()
Out[28]:
NumMonths CreditAmount PayBackPercent Gender ResidenceDuration Age ExistingCredit Dependents Telephone Foreignworker ... OtherPayBackPlan_none OtherPayBackPlan_stores Property_free Property_own Property_rent Job_management/self-emp/officer/highly qualif emp Job_skilled employee Job_unemp/unskilled-non resident Job_unskilled-resident CreditStatus
0 6 1169 4 1 4 1 2 1 1 1 ... 1 0 0 1 0 0 1 0 0 1
1 48 5951 2 0 2 0 1 1 0 1 ... 1 0 0 1 0 0 1 0 0 0
2 12 2096 2 1 3 1 1 2 0 1 ... 1 0 0 1 0 0 0 0 1 1
3 42 7882 2 1 4 1 1 2 0 1 ... 1 0 1 0 0 0 1 0 0 1
4 24 4870 3 1 4 1 2 2 0 1 ... 1 0 1 0 0 0 1 0 0 0

5 rows × 50 columns

Writing data to csv file

In [29]:
german_xai.to_csv('C:/Users/krish/Downloads/German-encoded.csv', index=False)

Splitting into train and test data

In [30]:
X = german_xai.iloc[:, :-1]
y = german_xai['CreditStatus']
X.head()
y.head()
X_train,X_test,y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=40,stratify=y)
Out[30]:
NumMonths CreditAmount PayBackPercent Gender ResidenceDuration Age ExistingCredit Dependents Telephone Foreignworker ... OtherPayBackPlan_bank OtherPayBackPlan_none OtherPayBackPlan_stores Property_free Property_own Property_rent Job_management/self-emp/officer/highly qualif emp Job_skilled employee Job_unemp/unskilled-non resident Job_unskilled-resident
0 6 1169 4 1 4 1 2 1 1 1 ... 0 1 0 0 1 0 0 1 0 0
1 48 5951 2 0 2 0 1 1 0 1 ... 0 1 0 0 1 0 0 1 0 0
2 12 2096 2 1 3 1 1 2 0 1 ... 0 1 0 0 1 0 0 0 0 1
3 42 7882 2 1 4 1 1 2 0 1 ... 0 1 0 1 0 0 0 1 0 0
4 24 4870 3 1 4 1 2 2 0 1 ... 0 1 0 1 0 0 0 1 0 0

5 rows × 49 columns

Out[30]:
0    1
1    0
2    1
3    1
4    0
Name: CreditStatus, dtype: int64
In [31]:
german_xai.dtypes
Out[31]:
NumMonths                                             int64
CreditAmount                                          int64
PayBackPercent                                        int64
Gender                                               object
ResidenceDuration                                     int64
Age                                                   int64
ExistingCredit                                        int64
Dependents                                            int64
Telephone                                             int64
Foreignworker                                         int64
Marital_Status                                       object
CreditHistory_Delay                                   uint8
CreditHistory_none/paid                               uint8
CreditHistory_other                                   uint8
Purpose_CarNew                                        uint8
Purpose_CarUsed                                       uint8
Purpose_biz                                           uint8
Purpose_domestic app                                  uint8
Purpose_education                                     uint8
Purpose_furniture/equip                               uint8
Purpose_others                                        uint8
Purpose_radio/tv                                      uint8
Purpose_repairs                                       uint8
Purpose_retraining                                    uint8
Savings_500+                                          uint8
Savings_<500                                          uint8
Savings_none                                          uint8
EmployDuration_1-4 yr                                 uint8
EmployDuration_4-7 yr                                 uint8
EmployDuration_<1 yr                                  uint8
EmployDuration_>=7 yr                                 uint8
EmployDuration_unemployed                             uint8
Debtors_co-applicant                                  uint8
Debtors_guarantor                                     uint8
Debtors_none                                          uint8
Collateral_car/other                                  uint8
Collateral_real estate                                uint8
Collateral_savings/life insurance                     uint8
Collateral_unknown/none                               uint8
OtherPayBackPlan_bank                                 uint8
OtherPayBackPlan_none                                 uint8
OtherPayBackPlan_stores                               uint8
Property_free                                         uint8
Property_own                                          uint8
Property_rent                                         uint8
Job_management/self-emp/officer/highly qualif emp     uint8
Job_skilled employee                                  uint8
Job_unemp/unskilled-non resident                      uint8
Job_unskilled-resident                                uint8
CreditStatus                                          int64
dtype: object
In [32]:
import klib
klib.missingval_plot(X)
klib.missingval_plot(y)
No missing values found in the dataset.
No missing values found in the dataset.

Feature Selection

1. Using Mutual info classif

In [33]:
from sklearn.feature_selection import mutual_info_classif
mutual_info=mutual_info_classif(X_train, y_train,random_state=40)
mutual_info
Out[33]:
array([0.06019707, 0.02108839, 0.00223861, 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.00353083, 0.02491331,
       0.        , 0.        , 0.00457068, 0.00844984, 0.        ,
       0.        , 0.01332007, 0.        , 0.        , 0.01102085,
       0.02957691, 0.        , 0.        , 0.00160791, 0.03429891,
       0.0457149 , 0.        , 0.        , 0.00819407, 0.00311788,
       0.02706385, 0.        , 0.        , 0.00205918, 0.0261869 ,
       0.        , 0.00280856, 0.00092909, 0.03215574, 0.02567867,
       0.00608715, 0.00376107, 0.        , 0.00802985, 0.        ,
       0.        , 0.        , 0.        , 0.        ])

Estimate mutual information for a discrete target variable.

Mutual information (MI) [1] between two random variables is a non-negative value, which measures the dependency between the variables. It is equal to zero if and only if two random variables are independent, and higher values mean higher dependency.

In [34]:
X.columns
Out[34]:
Index(['NumMonths', 'CreditAmount', 'PayBackPercent', 'Gender',
       'ResidenceDuration', 'Age', 'ExistingCredit', 'Dependents', 'Telephone',
       'Foreignworker', 'Marital_Status', 'CreditHistory_Delay',
       'CreditHistory_none/paid', 'CreditHistory_other', 'Purpose_CarNew',
       'Purpose_CarUsed', 'Purpose_biz', 'Purpose_domestic app',
       'Purpose_education', 'Purpose_furniture/equip', 'Purpose_others',
       'Purpose_radio/tv', 'Purpose_repairs', 'Purpose_retraining',
       'Savings_500+', 'Savings_<500', 'Savings_none', 'EmployDuration_1-4 yr',
       'EmployDuration_4-7 yr', 'EmployDuration_<1 yr',
       'EmployDuration_>=7 yr', 'EmployDuration_unemployed',
       'Debtors_co-applicant', 'Debtors_guarantor', 'Debtors_none',
       'Collateral_car/other', 'Collateral_real estate',
       'Collateral_savings/life insurance', 'Collateral_unknown/none',
       'OtherPayBackPlan_bank', 'OtherPayBackPlan_none',
       'OtherPayBackPlan_stores', 'Property_free', 'Property_own',
       'Property_rent', 'Job_management/self-emp/officer/highly qualif emp',
       'Job_skilled employee', 'Job_unemp/unskilled-non resident',
       'Job_unskilled-resident'],
      dtype='object')
In [35]:
mutual_info=pd.Series(mutual_info)
mutual_info.index=X_train.columns
mutual_info.sort_values(ascending=False)
Out[35]:
NumMonths                                            0.060197
Savings_<500                                         0.045715
Savings_500+                                         0.034299
Collateral_unknown/none                              0.032156
Purpose_others                                       0.029577
EmployDuration_>=7 yr                                0.027064
Debtors_none                                         0.026187
OtherPayBackPlan_bank                                0.025679
Foreignworker                                        0.024913
CreditAmount                                         0.021088
Purpose_biz                                          0.013320
Purpose_furniture/equip                              0.011021
CreditHistory_other                                  0.008450
EmployDuration_4-7 yr                                0.008194
Property_own                                         0.008030
OtherPayBackPlan_none                                0.006087
CreditHistory_none/paid                              0.004571
OtherPayBackPlan_stores                              0.003761
Telephone                                            0.003531
EmployDuration_<1 yr                                 0.003118
Collateral_real estate                               0.002809
PayBackPercent                                       0.002239
Debtors_guarantor                                    0.002059
Purpose_retraining                                   0.001608
Collateral_savings/life insurance                    0.000929
Property_free                                        0.000000
Property_rent                                        0.000000
Job_management/self-emp/officer/highly qualif emp    0.000000
Collateral_car/other                                 0.000000
Job_skilled employee                                 0.000000
Job_unemp/unskilled-non resident                     0.000000
Gender                                               0.000000
Savings_none                                         0.000000
Debtors_co-applicant                                 0.000000
EmployDuration_unemployed                            0.000000
EmployDuration_1-4 yr                                0.000000
ResidenceDuration                                    0.000000
Purpose_repairs                                      0.000000
Purpose_radio/tv                                     0.000000
Purpose_education                                    0.000000
Purpose_domestic app                                 0.000000
Purpose_CarUsed                                      0.000000
Purpose_CarNew                                       0.000000
CreditHistory_Delay                                  0.000000
Marital_Status                                       0.000000
Dependents                                           0.000000
ExistingCredit                                       0.000000
Age                                                  0.000000
Job_unskilled-resident                               0.000000
dtype: float64
In [36]:
mutual_info.sort_values(ascending=False).plot.bar(figsize=(15,5))
Out[36]:
<matplotlib.axes._subplots.AxesSubplot at 0x246db63ffc8>

Selecting top 10 features having highest dependencies w.r.to target variable CreditStatus

In [37]:
mutual_info.sort_values(ascending=False)[0:10]
Out[37]:
NumMonths                  0.060197
Savings_<500               0.045715
Savings_500+               0.034299
Collateral_unknown/none    0.032156
Purpose_others             0.029577
EmployDuration_>=7 yr      0.027064
Debtors_none               0.026187
OtherPayBackPlan_bank      0.025679
Foreignworker              0.024913
CreditAmount               0.021088
dtype: float64
In [38]:
german_xai_imp=german_xai[['Gender','Age','Marital_Status','NumMonths','Savings_<500','Savings_500+','Collateral_unknown/none','Purpose_others',
'EmployDuration_>=7 yr','Debtors_none','OtherPayBackPlan_bank','Foreignworker','CreditAmount','CreditStatus']]
german_xai_imp.head()
Out[38]:
Gender Age Marital_Status NumMonths Savings_<500 Savings_500+ Collateral_unknown/none Purpose_others EmployDuration_>=7 yr Debtors_none OtherPayBackPlan_bank Foreignworker CreditAmount CreditStatus
0 1 1 1 6 0 0 0 0 1 1 0 1 1169 1
1 0 0 0 48 1 0 0 0 0 1 0 1 5951 0
2 1 1 1 12 1 0 0 0 0 1 0 1 2096 1
3 1 1 1 42 1 0 0 0 0 0 0 1 7882 1
4 1 1 1 24 1 0 1 0 0 1 0 1 4870 0

2. Using correlation

In [39]:
corrMatrix = round(german_xai_imp.corr(),1)
corrMatrix
Out[39]:
Age NumMonths Savings_<500 Savings_500+ Collateral_unknown/none Purpose_others EmployDuration_>=7 yr Debtors_none OtherPayBackPlan_bank Foreignworker CreditAmount CreditStatus
Age 1.0 0.0 -0.1 0.0 0.1 0.1 0.2 0.0 0.0 -0.1 0.0 0.1
NumMonths 0.0 1.0 -0.0 -0.1 0.2 0.1 0.0 0.0 0.0 0.1 0.6 -0.2
Savings_<500 -0.1 -0.0 1.0 -0.5 -0.0 0.0 -0.1 -0.1 -0.0 0.0 -0.0 -0.2
Savings_500+ 0.0 -0.1 -0.5 1.0 -0.0 -0.0 0.0 0.1 -0.0 0.0 -0.1 0.1
Collateral_unknown/none 0.1 0.2 -0.0 -0.0 1.0 0.1 0.2 0.0 0.1 0.1 0.2 -0.1
Purpose_others 0.1 0.1 0.0 -0.0 0.1 1.0 0.0 -0.1 0.1 -0.0 0.2 -0.0
EmployDuration_>=7 yr 0.2 0.0 -0.1 0.0 0.2 0.0 1.0 0.0 0.1 0.1 -0.0 0.1
Debtors_none 0.0 0.0 -0.1 0.1 0.0 -0.1 0.0 1.0 -0.1 0.1 -0.0 0.0
OtherPayBackPlan_bank 0.0 0.0 -0.0 -0.0 0.1 0.1 0.1 -0.1 1.0 0.0 0.0 -0.1
Foreignworker -0.1 0.1 0.0 0.0 0.1 -0.0 0.1 0.1 0.0 1.0 0.1 -0.1
CreditAmount 0.0 0.6 -0.0 -0.1 0.2 0.2 -0.0 -0.0 0.0 0.1 1.0 -0.2
CreditStatus 0.1 -0.2 -0.2 0.1 -0.1 -0.0 0.1 0.0 -0.1 -0.1 -0.2 1.0
In [40]:
klib.corr_plot(german_xai_imp,target='CreditStatus')
Out[40]:
<matplotlib.axes._subplots.AxesSubplot at 0x246db9ec248>

No higher correlation is observed between input variables and between target variable and input variables

writing data to csv file

In [41]:
german_xai_imp.to_csv('C:/Users/krish/Downloads/German-reduced.csv', index=False)
In [42]:
#from sklearn.feature_selection import SelectPercentile
#selected_top=SelectPercentile(score_func=mutual_info_classif,  percentile=20)
#from sklearn.feature_selection import SelectKBest
#selected_top=SelectKBest(mutual_info_classif,k=10)
#selected_top.fit_transform(X_train,y_train)
In [43]:
#selected_top.fit_transform(X_train,y_train)
In [44]:
#X_sig=X_train.columns[selected_top.get_support()]
In [45]:
#X_sig
In [46]:
#X_train_sig=pd.DataFrame(X_train,columns=X_sig)
#X_test_sig=pd.DataFrame(X_test,columns=X_sig)
#X_train_sig.head()
#X_train_sig.shape
#X_test_sig.head()
#X_test_sig.shape

List of protected attributes

(https://arxiv.org/pdf/1811.11154.pdf)

In [47]:
from IPython.display  import Image
Image(filename='C:/Users/krish/Desktop/list of protected variables.png',width=500,height=30)
Out[47]:

From the above, we have 3 protected fields in our dataset:

1. Gender
2. Age
3. Marital Status

Now, let us identify previlege class in each protected attribute.

1.Gender

In [48]:
print(german_xai_imp['Gender'].value_counts())
german_xai_imp.groupby(['Gender'])['CreditStatus'].mean()
#https://arxiv.org/pdf/1810.01943.pdf, https://arxiv.org/pdf/2005.12379.pdf
1    690
0    310
Name: Gender, dtype: int64
Out[48]:
Gender
0    0.648387
1    0.723188
Name: CreditStatus, dtype: float64

Males(1) are more than females and for males(1) target variable CreditScore is more favorable having higher value for given number of males than female group average. Hence male(1) is privelieged class.

2.Age

In [49]:
print(german_xai_imp['Age'].value_counts())
german_xai_imp.groupby(['Age'])['CreditStatus'].mean()
1    810
0    190
Name: Age, dtype: int64
Out[49]:
Age
0    0.578947
1    0.728395
Name: CreditStatus, dtype: float64

Age >26: 1; else 0; so ppl above 26 are more and group average of ppl with age >26 is higher than the group of age < 26 ,so age(1) is priveleiged group

3. Marital Status

In [50]:
print(german_xai_imp['Marital_Status'].value_counts())
german_xai_imp.groupby(['Marital_Status'])['CreditStatus'].mean()
1    548
0    452
Name: Marital_Status, dtype: int64
Out[50]:
Marital_Status
0    0.659292
1    0.733577
Name: CreditStatus, dtype: float64

Singles(1) are more than not singles and for singles(1) target variable CreditScore is more favorable having higher value for given number of singles than non singles group average. Hence singles(1) is privelieged group

Converting Dataframe to aif compatible format

BinaryLabelDataset: Base class for all structured datasets with binary labels.

In [51]:
# Fairness metrics
from aif360.metrics import BinaryLabelDatasetMetric
from aif360.explainers import MetricTextExplainer
from aif360.metrics import ClassificationMetric
# Get DF into IBM format
from aif360 import datasets
aif_train_dataset = datasets.BinaryLabelDataset(favorable_label = 1, unfavorable_label = 0, df=german_xai_imp,
                                                      label_names=["CreditStatus"],
                                                     protected_attribute_names=["Age","Gender","Marital_Status"],
                                              privileged_protected_attributes = [1,1,1])
dataset_orig_train, dataset_orig_test = aif_train_dataset.split([0.7], shuffle=True)
In [52]:
dataset_orig_train.feature_names
Out[52]:
['Gender',
 'Age',
 'Marital_Status',
 'NumMonths',
 'Savings_<500',
 'Savings_500+',
 'Collateral_unknown/none',
 'Purpose_others',
 'EmployDuration_>=7 yr',
 'Debtors_none',
 'OtherPayBackPlan_bank',
 'Foreignworker',
 'CreditAmount']

Measuring fairness:

Disparate Impact

a) With respect to Gender

In [139]:
# Disparate impact measurement for gender
metric_aif_train_ready_gender = BinaryLabelDatasetMetric(
        aif_train_dataset,
        unprivileged_groups=[{"Age":0,"Gender":0,"Marital_Status":0}],  privileged_groups=[{"Age":1,"Gender":1,"Marital_Status":1}]) 
explainer_aif_train_ready_gender = MetricTextExplainer(metric_aif_train_ready_gender)

print(explainer_aif_train_ready_gender.disparate_impact())
print("Difference in mean outcomes between unprivileged and privileged groups of gender = %f" % metric_aif_train_ready_gender.mean_difference())
Disparate impact (probability of favorable outcome for unprivileged instances / probability of favorable outcome for privileged instances): 0.7385093167701864
Difference in mean outcomes between unprivileged and privileged groups of gender = -0.195587

Handling bias: Reweighing

In [55]:
from aif360.algorithms.preprocessing import Reweighing
privileged_groups = [{'Gender': 1}]
unprivileged_groups = [{'Gender': 0}]
RW_gender = Reweighing(unprivileged_groups=unprivileged_groups,
                privileged_groups=privileged_groups)
dataset_aif_tranf_gender = RW_gender.fit_transform(dataset_orig_train)
metric_transf_train_gender = BinaryLabelDatasetMetric(dataset_aif_tranf_gender, 
                                               unprivileged_groups=unprivileged_groups,
                                               privileged_groups=privileged_groups)

print("Difference in mean outcomes between unprivileged and privileged groups of gender = %f" % metric_transf_train_gender.mean_difference())
WARNING:root:No module named 'numba.decorators': LFR will be unavailable. To install, run:
pip install 'aif360[LFR]'
Difference in mean outcomes between unprivileged and privileged groups of gender = 0.000000

b) with respect to Age

In [56]:
# Disparate impact measurement for age
metric_aif_train_ready_age = BinaryLabelDatasetMetric(
        aif_train_dataset,
        unprivileged_groups=[{"Age":0}],
        privileged_groups=[{"Age":1}])
explainer_aif_train_ready_age = MetricTextExplainer(metric_aif_train_ready_age)

print(explainer_aif_train_ready_age.disparate_impact())
print("Difference in mean outcomes between unprivileged and privileged groups of age = %f" % metric_aif_train_ready_age.mean_difference())
Disparate impact (probability of favorable outcome for unprivileged instances / probability of favorable outcome for privileged instances): 0.7948260481712757
Difference in mean outcomes between unprivileged and privileged groups of age = -0.149448

Handling bias: Reweighing

In [57]:
from aif360.algorithms.preprocessing import Reweighing
privileged_groups = [{'Age': 1}]
unprivileged_groups = [{'Age': 0}]
RW_age = Reweighing(unprivileged_groups=unprivileged_groups,
                privileged_groups=privileged_groups)
dataset_aif_tranf_age = RW_age.fit_transform(dataset_orig_train)
metric_transf_train_age = BinaryLabelDatasetMetric(dataset_aif_tranf_age, 
                                               unprivileged_groups=unprivileged_groups,
                                               privileged_groups=privileged_groups)

print("Difference in mean outcomes between unprivileged and privileged groups of age = %f" % metric_transf_train_age.mean_difference())
Difference in mean outcomes between unprivileged and privileged groups of age = -0.000000
In [58]:
metric_transf_train_age
Out[58]:
<aif360.metrics.binary_label_dataset_metric.BinaryLabelDatasetMetric at 0x246dcc34b88>

c) with respect to Marital Status

In [59]:
# Disparate impact measurement for age
metric_aif_train_ready_marital = BinaryLabelDatasetMetric(
        aif_train_dataset,
        unprivileged_groups=[{"Marital_Status":0}],
        privileged_groups=[{"Marital_Status":1}])
explainer_aif_train_ready_marital = MetricTextExplainer(metric_aif_train_ready_marital)

print(explainer_aif_train_ready_marital.disparate_impact())
print("Difference in mean outcomes between unprivileged and privileged groups of marital status = %f" % metric_aif_train_ready_marital.mean_difference())
Disparate impact (probability of favorable outcome for unprivileged instances / probability of favorable outcome for privileged instances): 0.8987364064632589
Difference in mean outcomes between unprivileged and privileged groups of marital status = -0.074285

Handling bias: Reweighing

In [60]:
from aif360.algorithms.preprocessing import Reweighing
privileged_groups = [{'Marital_Status': 1}]
unprivileged_groups = [{'Marital_Status': 0}]
RW_Marital = Reweighing(unprivileged_groups=unprivileged_groups,
                privileged_groups=privileged_groups)
dataset_aif_tranf_marital = RW_Marital.fit_transform(dataset_orig_train)
metric_transf_train_marital = BinaryLabelDatasetMetric(dataset_aif_tranf_marital, 
                                               unprivileged_groups=unprivileged_groups,
                                               privileged_groups=privileged_groups)

print("Difference in mean outcomes between unprivileged and privileged groups of marital status = %f" % metric_transf_train_marital.mean_difference())
Difference in mean outcomes between unprivileged and privileged groups of marital status = -0.000000

Building a ML model

RANDOM FOREST

In [61]:
#Seting the Hyper Parameters
param_grid = {"max_depth": [3,5,7,None],
              "n_estimators":[3,5,10,15,20],
              "max_features": [4,7,15]}
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
#Creating the classifier
rf_model = RandomForestClassifier(random_state=40)
grid_search = GridSearchCV(rf_model, param_grid=param_grid, cv=5, scoring='recall', verbose=0)
model = grid_search

a) with age as protected variable in the dataset

In [62]:
mdl_age = model.fit(dataset_aif_tranf_age.features, dataset_aif_tranf_age.labels.ravel())
In [63]:
rf_shap_values = shap.KernelExplainer(grid_search.predict,dataset_aif_tranf_age.features)
WARNING:shap:Using 700 background data samples could cause slower run times. Consider using shap.sample(data, K) or shap.kmeans(data, K) to summarize the background as K samples.
In [64]:
importances = model.best_estimator_.feature_importances_
indices = np.argsort(importances)
features = dataset_aif_tranf_age.feature_names
#https://stackoverflow.com/questions/48377296/get-feature-importance-from-gridsearchcv
In [65]:
importances

plt.figure(figsize=(20,30))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='b', align='center')
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
Out[65]:
array([0.03054944, 0.14122543, 0.04050242, 0.21946099, 0.15563724,
       0.01814715, 0.0132588 , 0.00077236, 0.02369723, 0.        ,
       0.03601172, 0.        , 0.32073723])
Out[65]:
<Figure size 1440x2160 with 0 Axes>
Out[65]:
Text(0.5, 1.0, 'Feature Importances')
Out[65]:
<BarContainer object of 13 artists>
Out[65]:
([<matplotlib.axis.YTick at 0x246dccef848>,
  <matplotlib.axis.YTick at 0x246dccec388>,
  <matplotlib.axis.YTick at 0x246dbce9c48>,
  <matplotlib.axis.YTick at 0x246dccf81c8>,
  <matplotlib.axis.YTick at 0x246dccf8d48>,
  <matplotlib.axis.YTick at 0x246dcd21348>,
  <matplotlib.axis.YTick at 0x246dcd21a08>,
  <matplotlib.axis.YTick at 0x246dcd21488>,
  <matplotlib.axis.YTick at 0x246dcd27148>,
  <matplotlib.axis.YTick at 0x246dcd277c8>,
  <matplotlib.axis.YTick at 0x246dcd2b248>,
  <matplotlib.axis.YTick at 0x246dcd2b788>,
  <matplotlib.axis.YTick at 0x246dcd302c8>],
 <a list of 13 Text yticklabel objects>)
Out[65]:
Text(0.5, 0, 'Relative Importance')

Features that are important in the model are given above.

Model Explainability

a) Using SHAP

In [66]:
mdl_age.best_params_
type(model)
explainer = shap.TreeExplainer(grid_search.best_estimator_)
shap_values_a=explainer.shap_values(dataset_aif_tranf_age.features, dataset_aif_tranf_age.labels.ravel())
#https://github.com/slundberg/shap/issues/968
Out[66]:
{'max_depth': 3, 'max_features': 7, 'n_estimators': 10}
Out[66]:
sklearn.model_selection._search.GridSearchCV
In [67]:
shap_values_a
Out[67]:
[array([[-5.76477868e-03, -2.32884435e-02, -6.61527689e-03, ...,
         -5.01754168e-03,  0.00000000e+00,  7.49099651e-02],
        [-7.72110140e-03, -2.90583936e-02, -3.04146469e-03, ...,
         -1.62416087e-03,  0.00000000e+00, -5.68452118e-02],
        [-7.72110140e-03, -2.36312465e-02, -8.19963297e-03, ...,
         -1.75775342e-03,  0.00000000e+00, -2.22358858e-02],
        ...,
        [ 8.26887424e-03,  6.77228995e-02,  3.70893200e-03, ...,
          1.56349442e-02,  0.00000000e+00, -2.36211556e-02],
        [ 7.70202887e-05, -1.87248363e-02, -8.19963297e-03, ...,
         -5.53378520e-03,  0.00000000e+00,  9.27872387e-03],
        [ 2.04286235e-02, -2.13479386e-02,  4.29016770e-03, ...,
         -1.11661151e-03,  0.00000000e+00, -1.67289743e-02]]),
 array([[ 5.76477868e-03,  2.32884435e-02,  6.61527689e-03, ...,
          5.01754168e-03,  0.00000000e+00, -7.49099651e-02],
        [ 7.72110140e-03,  2.90583936e-02,  3.04146469e-03, ...,
          1.62416087e-03,  0.00000000e+00,  5.68452118e-02],
        [ 7.72110140e-03,  2.36312465e-02,  8.19963297e-03, ...,
          1.75775342e-03,  0.00000000e+00,  2.22358858e-02],
        ...,
        [-8.26887424e-03, -6.77228995e-02, -3.70893200e-03, ...,
         -1.56349442e-02,  0.00000000e+00,  2.36211556e-02],
        [-7.70202887e-05,  1.87248363e-02,  8.19963297e-03, ...,
          5.53378520e-03,  0.00000000e+00, -9.27872387e-03],
        [-2.04286235e-02,  2.13479386e-02, -4.29016770e-03, ...,
          1.11661151e-03,  0.00000000e+00,  1.67289743e-02]])]

The shap_values[0] are explanations with respect to the negative class, while shap_values[1] are explanations with respect to the positive class.

Features in blue pushes the base value towards lowest values and features in red moves base levels towards higher values.

In [68]:
shap.initjs()
shap.force_plot(explainer.expected_value[0],shap_values_a[0][0], dataset_aif_tranf_age.feature_names)
#https://github.com/slundberg/shap
#https://github.com/slundberg/shap/issues/279
Out[68]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.
In [69]:
shap.initjs()
shap.force_plot(explainer.expected_value[1],shap_values_a[1][0], dataset_aif_tranf_age.feature_names)
Out[69]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.
In [70]:
dataset_aif_tranf_age.feature_names
Out[70]:
['Gender',
 'Age',
 'Marital_Status',
 'NumMonths',
 'Savings_<500',
 'Savings_500+',
 'Collateral_unknown/none',
 'Purpose_others',
 'EmployDuration_>=7 yr',
 'Debtors_none',
 'OtherPayBackPlan_bank',
 'Foreignworker',
 'CreditAmount']
In [71]:
shap.force_plot(explainer.expected_value[0],
                shap_values_a[0][:,:], dataset_aif_tranf_age.features[:,:],feature_names = dataset_aif_tranf_age.feature_names)
Out[71]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.
In [72]:
shap.force_plot(explainer.expected_value[1],
                shap_values_a[1][:,:], dataset_aif_tranf_age.features[:,:],feature_names = dataset_aif_tranf_age.feature_names)
Out[72]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.
In [73]:
p = shap.summary_plot(shap_values_a, dataset_aif_tranf_age.features, feature_names=dataset_aif_tranf_age.feature_names) 
display(p)
None

variables with higher impact are Age,CreditAmount,NumMonths,Savings etc

In [74]:
shap.plots._waterfall.waterfall_legacy(explainer.expected_value[0], shap_values_a[0][0],feature_names=dataset_aif_tranf_age.feature_names)

Interpretation of graph: https://shap.readthedocs.io/en/latest/example_notebooks/overviews/An%20introduction%20to%20explainable%20AI%20with%20Shapley%20values.html

The above explanation shows features each contributing to push the model output from the base value (the average model output over the training dataset we passed) to the model output. Features pushing the prediction higher are shown in red, those pushing the prediction lower are in blue.

f(x)- model output impacted by features; E(f(x))- expected output.

One the fundemental properties of Shapley values is that they always sum up to the difference between the game outcome when all players are present and the game outcome when no players are present. For machine learning models this means that SHAP values of all the input features will always sum up to the difference between baseline (expected) model output and the current model output for the prediction being explained.

In [75]:
shap.plots._waterfall.waterfall_legacy(explainer.expected_value[1], shap_values_a[1][0],feature_names=dataset_aif_tranf_age.feature_names)

b) Using eli5

In [76]:
#!pip install eli5
from eli5.sklearn import PermutationImportance
In [77]:
perm_age = PermutationImportance(model).fit(dataset_aif_tranf_age.features, dataset_aif_tranf_age.labels.ravel())
In [78]:
perm_imp_1=eli5.show_weights(perm_age,feature_names = dataset_aif_tranf_age.feature_names)
perm_imp_1
plt.show()
Out[78]:
Weight Feature
0.0233 ± 0.0080 NumMonths
0.0192 ± 0.0080 CreditAmount
0.0118 ± 0.0162 Age
0.0106 ± 0.0031 Savings_<500
0.0024 ± 0.0031 Marital_Status
0.0012 ± 0.0033 OtherPayBackPlan_bank
0.0012 ± 0.0020 Gender
0 ± 0.0000 Foreignworker
0 ± 0.0000 Debtors_none
0 ± 0.0000 EmployDuration_>=7 yr
0 ± 0.0000 Purpose_others
-0.0008 ± 0.0033 Collateral_unknown/none
-0.0012 ± 0.0033 Savings_500+

eli5 provides a way to compute feature importances for any black-box estimator by measuring how score decreases when a feature is not available; the method is also known as “permutation importance” or “Mean Decrease Accuracy (MDA)”.

The first number in each row shows how much model performance decreased with a random shuffling (in this case, using "accuracy" as the performance metric).

Like most things in data science, there is some randomness to the exact performance change from a shuffling a column. We measure the amount of randomness in our permutation importance calculation by repeating the process with multiple shuffles. The number after the ± measures how performance varied from one-reshuffling to the next.

You'll occasionally see negative values for permutation importances. In those cases, the predictions on the shuffled (or noisy) data happened to be more accurate than the real data. This happens when the feature didn't matter (should have had an importance close to 0), but random chance caused the predictions on shuffled data to be more accurate. This is more common with small datasets, like the one in this example, because there is more room for luck/chance.

https://www.kaggle.com/dansbecker/permutation-importance

b) with gender as protected variable in the dataset

In [79]:
mdl_gender = model.fit(dataset_aif_tranf_gender.features, dataset_aif_tranf_gender.labels.ravel())
In [80]:
rf_shap_values = shap.KernelExplainer(grid_search.predict,dataset_aif_tranf_gender.features)
WARNING:shap:Using 700 background data samples could cause slower run times. Consider using shap.sample(data, K) or shap.kmeans(data, K) to summarize the background as K samples.
In [81]:
importances = model.best_estimator_.feature_importances_
indices = np.argsort(importances)
features = dataset_aif_tranf_gender.feature_names
#https://stackoverflow.com/questions/48377296/get-feature-importance-from-gridsearchcv
In [82]:
importances

plt.figure(figsize=(20,30))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='b', align='center')
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
Out[82]:
array([0.03054944, 0.14122543, 0.04050242, 0.21946099, 0.15563724,
       0.01814715, 0.0132588 , 0.00077236, 0.02369723, 0.        ,
       0.03601172, 0.        , 0.32073723])
Out[82]:
<Figure size 1440x2160 with 0 Axes>
Out[82]:
Text(0.5, 1.0, 'Feature Importances')
Out[82]:
<BarContainer object of 13 artists>
Out[82]:
([<matplotlib.axis.YTick at 0x246df419f08>,
  <matplotlib.axis.YTick at 0x246df428a88>,
  <matplotlib.axis.YTick at 0x246df428ac8>,
  <matplotlib.axis.YTick at 0x246df3fef88>,
  <matplotlib.axis.YTick at 0x246df3fe3c8>,
  <matplotlib.axis.YTick at 0x246df412e08>,
  <matplotlib.axis.YTick at 0x246df412908>,
  <matplotlib.axis.YTick at 0x246df3fe708>,
  <matplotlib.axis.YTick at 0x246df5a7e48>,
  <matplotlib.axis.YTick at 0x246df41c508>,
  <matplotlib.axis.YTick at 0x246df42d608>,
  <matplotlib.axis.YTick at 0x246df425e08>,
  <matplotlib.axis.YTick at 0x246df425508>],
 <a list of 13 Text yticklabel objects>)
Out[82]:
Text(0.5, 0, 'Relative Importance')
Model Explainability

a) Using SHAP

In [83]:
mdl_gender.best_params_
type(model)
explainer = shap.TreeExplainer(grid_search.best_estimator_)
shap_values_b=explainer.shap_values(dataset_aif_tranf_gender.features, dataset_aif_tranf_gender.labels.ravel())
#https://github.com/slundberg/shap/issues/968
Out[83]:
{'max_depth': 3, 'max_features': 7, 'n_estimators': 10}
Out[83]:
sklearn.model_selection._search.GridSearchCV
In [84]:
shap_values_b
Out[84]:
[array([[-5.76477868e-03, -2.32884435e-02, -6.61527689e-03, ...,
         -5.01754168e-03,  0.00000000e+00,  7.49099651e-02],
        [-7.72110140e-03, -2.90583936e-02, -3.04146469e-03, ...,
         -1.62416087e-03,  0.00000000e+00, -5.68452118e-02],
        [-7.72110140e-03, -2.36312465e-02, -8.19963297e-03, ...,
         -1.75775342e-03,  0.00000000e+00, -2.22358858e-02],
        ...,
        [ 8.26887424e-03,  6.77228995e-02,  3.70893200e-03, ...,
          1.56349442e-02,  0.00000000e+00, -2.36211556e-02],
        [ 7.70202887e-05, -1.87248363e-02, -8.19963297e-03, ...,
         -5.53378520e-03,  0.00000000e+00,  9.27872387e-03],
        [ 2.04286235e-02, -2.13479386e-02,  4.29016770e-03, ...,
         -1.11661151e-03,  0.00000000e+00, -1.67289743e-02]]),
 array([[ 5.76477868e-03,  2.32884435e-02,  6.61527689e-03, ...,
          5.01754168e-03,  0.00000000e+00, -7.49099651e-02],
        [ 7.72110140e-03,  2.90583936e-02,  3.04146469e-03, ...,
          1.62416087e-03,  0.00000000e+00,  5.68452118e-02],
        [ 7.72110140e-03,  2.36312465e-02,  8.19963297e-03, ...,
          1.75775342e-03,  0.00000000e+00,  2.22358858e-02],
        ...,
        [-8.26887424e-03, -6.77228995e-02, -3.70893200e-03, ...,
         -1.56349442e-02,  0.00000000e+00,  2.36211556e-02],
        [-7.70202887e-05,  1.87248363e-02,  8.19963297e-03, ...,
          5.53378520e-03,  0.00000000e+00, -9.27872387e-03],
        [-2.04286235e-02,  2.13479386e-02, -4.29016770e-03, ...,
          1.11661151e-03,  0.00000000e+00,  1.67289743e-02]])]
In [85]:
shap.initjs()
shap.force_plot(explainer.expected_value[0],shap_values_b[0][0], dataset_aif_tranf_gender.feature_names)
#https://github.com/slundberg/shap
#https://github.com/slundberg/shap/issues/279
Out[85]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.

The shap_values[0] are explanations with respect to the negative class, while shap_values[1] are explanations with respect to the positive class.

In [86]:
shap.initjs()
shap.force_plot(explainer.expected_value[1],shap_values_b[1][0], dataset_aif_tranf_gender.feature_names)
Out[86]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.
In [87]:
dataset_aif_tranf_gender.feature_names
Out[87]:
['Gender',
 'Age',
 'Marital_Status',
 'NumMonths',
 'Savings_<500',
 'Savings_500+',
 'Collateral_unknown/none',
 'Purpose_others',
 'EmployDuration_>=7 yr',
 'Debtors_none',
 'OtherPayBackPlan_bank',
 'Foreignworker',
 'CreditAmount']
In [88]:
shap.force_plot(explainer.expected_value[0],
                shap_values_b[0][:,:], dataset_aif_tranf_gender.features[:,:],feature_names = dataset_aif_tranf_gender.feature_names)
Out[88]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.
In [89]:
shap.force_plot(explainer.expected_value[1],
                shap_values_b[1][:,:], dataset_aif_tranf_gender.features[:,:],feature_names = dataset_aif_tranf_gender.feature_names)
Out[89]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.
In [90]:
p = shap.summary_plot(shap_values_b, dataset_aif_tranf_gender.features, feature_names=dataset_aif_tranf_gender.feature_names) 
display(p)
None
In [91]:
shap.plots._waterfall.waterfall_legacy(explainer.expected_value[0], shap_values_b[0][0],feature_names=dataset_aif_tranf_gender.feature_names)

Interpretation of graph: https://shap.readthedocs.io/en/latest/example_notebooks/overviews/An%20introduction%20to%20explainable%20AI%20with%20Shapley%20values.html

f(x)- model output impacted by features; E(f(x))- expected output.

The above explanation shows features each contributing to push the model output from the base value (the average model output over the training dataset we passed) to the model output. Features pushing the prediction higher are shown in red, those pushing the prediction lower are in blue.

One the fundemental properties of Shapley values is that they always sum up to the difference between the game outcome when all players are present and the game outcome when no players are present. For machine learning models this means that SHAP values of all the input features will always sum up to the difference between baseline (expected) model output and the current model output for the prediction being explained.

In [92]:
shap.plots._waterfall.waterfall_legacy(explainer.expected_value[1], shap_values_b[1][0],feature_names=dataset_aif_tranf_gender.feature_names)

b) Using eli5

In [93]:
#!pip install eli5
from eli5.sklearn import PermutationImportance
In [94]:
perm_gender = PermutationImportance(model).fit(dataset_aif_tranf_gender.features, dataset_aif_tranf_gender.labels.ravel())
In [95]:
perm_imp_2=eli5.show_weights(perm_gender,feature_names = dataset_aif_tranf_gender.feature_names)
perm_imp_2
plt.show()
Out[95]:
Weight Feature
0.0200 ± 0.0094 NumMonths
0.0155 ± 0.0084 CreditAmount
0.0127 ± 0.0031 Savings_<500
0.0090 ± 0.0099 Age
0.0020 ± 0.0000 Marital_Status
0.0012 ± 0.0033 OtherPayBackPlan_bank
0.0012 ± 0.0020 Gender
0 ± 0.0000 Foreignworker
0 ± 0.0000 Debtors_none
0 ± 0.0000 EmployDuration_>=7 yr
0 ± 0.0000 Purpose_others
-0.0004 ± 0.0031 Collateral_unknown/none
-0.0012 ± 0.0020 Savings_500+

eli5 provides a way to compute feature importances for any black-box estimator by measuring how score decreases when a feature is not available; the method is also known as “permutation importance” or “Mean Decrease Accuracy (MDA)”.

In [ ]:
 

c) with marital status as protected variable in the dataset

In [96]:
mdl_marital = model.fit(dataset_aif_tranf_marital.features, dataset_aif_tranf_marital.labels.ravel())
In [97]:
rf_shap_values = shap.KernelExplainer(grid_search.predict,dataset_aif_tranf_marital.features)
WARNING:shap:Using 700 background data samples could cause slower run times. Consider using shap.sample(data, K) or shap.kmeans(data, K) to summarize the background as K samples.
In [98]:
importances = model.best_estimator_.feature_importances_
indices = np.argsort(importances)
features = dataset_aif_tranf_marital.feature_names
#https://stackoverflow.com/questions/48377296/get-feature-importance-from-gridsearchcv
In [99]:
importances

plt.figure(figsize=(20,30))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='b', align='center')
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
Out[99]:
array([0.03054944, 0.14122543, 0.04050242, 0.21946099, 0.15563724,
       0.01814715, 0.0132588 , 0.00077236, 0.02369723, 0.        ,
       0.03601172, 0.        , 0.32073723])
Out[99]:
<Figure size 1440x2160 with 0 Axes>
Out[99]:
Text(0.5, 1.0, 'Feature Importances')
Out[99]:
<BarContainer object of 13 artists>
Out[99]:
([<matplotlib.axis.YTick at 0x246df716f88>,
  <matplotlib.axis.YTick at 0x246df645b08>,
  <matplotlib.axis.YTick at 0x246df649c48>,
  <matplotlib.axis.YTick at 0x246df77ed48>,
  <matplotlib.axis.YTick at 0x246df631588>,
  <matplotlib.axis.YTick at 0x246df649288>,
  <matplotlib.axis.YTick at 0x246df6bbac8>,
  <matplotlib.axis.YTick at 0x246df745688>,
  <matplotlib.axis.YTick at 0x246df69d208>,
  <matplotlib.axis.YTick at 0x246df673a08>,
  <matplotlib.axis.YTick at 0x246df680d88>,
  <matplotlib.axis.YTick at 0x246df75bb08>,
  <matplotlib.axis.YTick at 0x246df6d2308>],
 <a list of 13 Text yticklabel objects>)
Out[99]:
Text(0.5, 0, 'Relative Importance')
Model Explainability

a) Using SHAP

In [100]:
mdl_gender.best_params_
type(model)
explainer = shap.TreeExplainer(grid_search.best_estimator_)
shap_values_c=explainer.shap_values(dataset_aif_tranf_marital.features, dataset_aif_tranf_marital.labels.ravel())
#https://github.com/slundberg/shap/issues/968
Out[100]:
{'max_depth': 3, 'max_features': 7, 'n_estimators': 10}
Out[100]:
sklearn.model_selection._search.GridSearchCV
In [101]:
shap_values_c
Out[101]:
[array([[-5.76477868e-03, -2.32884435e-02, -6.61527689e-03, ...,
         -5.01754168e-03,  0.00000000e+00,  7.49099651e-02],
        [-7.72110140e-03, -2.90583936e-02, -3.04146469e-03, ...,
         -1.62416087e-03,  0.00000000e+00, -5.68452118e-02],
        [-7.72110140e-03, -2.36312465e-02, -8.19963297e-03, ...,
         -1.75775342e-03,  0.00000000e+00, -2.22358858e-02],
        ...,
        [ 8.26887424e-03,  6.77228995e-02,  3.70893200e-03, ...,
          1.56349442e-02,  0.00000000e+00, -2.36211556e-02],
        [ 7.70202887e-05, -1.87248363e-02, -8.19963297e-03, ...,
         -5.53378520e-03,  0.00000000e+00,  9.27872387e-03],
        [ 2.04286235e-02, -2.13479386e-02,  4.29016770e-03, ...,
         -1.11661151e-03,  0.00000000e+00, -1.67289743e-02]]),
 array([[ 5.76477868e-03,  2.32884435e-02,  6.61527689e-03, ...,
          5.01754168e-03,  0.00000000e+00, -7.49099651e-02],
        [ 7.72110140e-03,  2.90583936e-02,  3.04146469e-03, ...,
          1.62416087e-03,  0.00000000e+00,  5.68452118e-02],
        [ 7.72110140e-03,  2.36312465e-02,  8.19963297e-03, ...,
          1.75775342e-03,  0.00000000e+00,  2.22358858e-02],
        ...,
        [-8.26887424e-03, -6.77228995e-02, -3.70893200e-03, ...,
         -1.56349442e-02,  0.00000000e+00,  2.36211556e-02],
        [-7.70202887e-05,  1.87248363e-02,  8.19963297e-03, ...,
          5.53378520e-03,  0.00000000e+00, -9.27872387e-03],
        [-2.04286235e-02,  2.13479386e-02, -4.29016770e-03, ...,
          1.11661151e-03,  0.00000000e+00,  1.67289743e-02]])]
In [102]:
shap.initjs()
shap.force_plot(explainer.expected_value[0],shap_values_c[0][0], dataset_aif_tranf_marital.feature_names)
#https://github.com/slundberg/shap
#https://github.com/slundberg/shap/issues/279
Out[102]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.

The shap_values[0] are explanations with respect to the negative class, while shap_values[1] are explanations with respect to the positive class.

In [103]:
shap.initjs()
shap.force_plot(explainer.expected_value[1],shap_values_c[1][0], dataset_aif_tranf_marital.feature_names)
Out[103]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.
In [104]:
dataset_aif_tranf_marital.feature_names
Out[104]:
['Gender',
 'Age',
 'Marital_Status',
 'NumMonths',
 'Savings_<500',
 'Savings_500+',
 'Collateral_unknown/none',
 'Purpose_others',
 'EmployDuration_>=7 yr',
 'Debtors_none',
 'OtherPayBackPlan_bank',
 'Foreignworker',
 'CreditAmount']
In [105]:
shap.force_plot(explainer.expected_value[0],
                shap_values_c[0][:,:], dataset_aif_tranf_marital.features[:,:],feature_names = dataset_aif_tranf_marital.feature_names)
Out[105]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.
In [106]:
shap.force_plot(explainer.expected_value[1],
                shap_values_c[1][:,:], dataset_aif_tranf_marital.features[:,:],feature_names = dataset_aif_tranf_marital.feature_names)
Out[106]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.
In [107]:
p = shap.summary_plot(shap_values_c, dataset_aif_tranf_marital.features, feature_names=dataset_aif_tranf_marital.feature_names) 
display(p)
None
In [108]:
shap.plots._waterfall.waterfall_legacy(explainer.expected_value[0], shap_values_c[0][0],feature_names=dataset_aif_tranf_marital.feature_names)

Interpretation of graph: https://shap.readthedocs.io/en/latest/example_notebooks/overviews/An%20introduction%20to%20explainable%20AI%20with%20Shapley%20values.html

f(x)- model output impacted by features; E(f(x))- expected output.

The above explanation shows features each contributing to push the model output from the base value (the average model output over the training dataset we passed) to the model output. Features pushing the prediction higher are shown in red, those pushing the prediction lower are in blue.

One the fundemental properties of Shapley values is that they always sum up to the difference between the game outcome when all players are present and the game outcome when no players are present. For machine learning models this means that SHAP values of all the input features will always sum up to the difference between baseline (expected) model output and the current model output for the prediction being explained.

In [109]:
shap.plots._waterfall.waterfall_legacy(explainer.expected_value[1], shap_values_c[1][0],feature_names=dataset_aif_tranf_marital.feature_names)

b) Using eli5

In [110]:
#!pip install eli5
from eli5.sklearn import PermutationImportance
In [111]:
perm_marital = PermutationImportance(model).fit(dataset_aif_tranf_marital.features, dataset_aif_tranf_marital.labels.ravel())
In [112]:
perm_imp_3=eli5.show_weights(perm_marital,feature_names = dataset_aif_tranf_marital.feature_names)
perm_imp_3
plt.show()
Out[112]:
Weight Feature
0.0220 ± 0.0060 NumMonths
0.0151 ± 0.0095 CreditAmount
0.0118 ± 0.0060 Age
0.0090 ± 0.0055 Savings_<500
0.0024 ± 0.0031 Marital_Status
0.0016 ± 0.0031 OtherPayBackPlan_bank
0.0016 ± 0.0031 Gender
0 ± 0.0000 Foreignworker
0 ± 0.0000 Debtors_none
0 ± 0.0000 EmployDuration_>=7 yr
0 ± 0.0000 Purpose_others
0 ± 0.0000 Savings_500+
-0.0016 ± 0.0016 Collateral_unknown/none

eli5 provides a way to compute feature importances for any black-box estimator by measuring how score decreases when a feature is not available; the method is also known as “permutation importance” or “Mean Decrease Accuracy (MDA)”.

XGBOOST

In [113]:
from xgboost import XGBClassifier
estimator = XGBClassifier(seed=40)

parameters = {
    'max_depth': range (2, 10, 2),
    'n_estimators': range(60, 240, 40),
    'learning_rate': [0.1, 0.01, 0.05]
}
grid_search = GridSearchCV(
    estimator=estimator,
    param_grid=parameters,
    scoring = 'recall',
    
    cv = 5,
    verbose=0
)

model=grid_search
In [114]:
#rf_shap_values = shap.KernelExplainer(grid_search.predict,dataset_aif_tranf_age.features)

a) age as protected variable

In [115]:
mdl_age = model.fit(dataset_aif_tranf_age.features, dataset_aif_tranf_age.labels.ravel())
In [116]:
# explain the model's predictions using SHAP
# (same syntax works for LightGBM, CatBoost, scikit-learn, transformers, Spark, etc.)
explainer = shap.TreeExplainer(grid_search.best_estimator_,dataset_aif_tranf_age.features)
shap_values=explainer.shap_values(dataset_aif_tranf_age.features, dataset_aif_tranf_age.labels.ravel())
#https://github.com/slundberg/shap
In [117]:
shap_values
Out[117]:
array([[ 0.        ,  0.11156371,  0.00077599, ...,  0.00833379,
         0.        , -0.10627968],
       [ 0.        ,  0.07316086,  0.00077599, ...,  0.00833379,
         0.        ,  0.04096031],
       [ 0.        ,  0.06495592,  0.00077599, ...,  0.00833379,
         0.        ,  0.02756702],
       ...,
       [ 0.        , -0.18114036, -0.00346608, ..., -0.15743269,
         0.        ,  0.0271022 ],
       [ 0.        ,  0.0735417 ,  0.00315569, ...,  0.03174634,
         0.        ,  0.02756702],
       [ 0.        ,  0.07915238, -0.00067252, ...,  0.00833379,
         0.        ,  0.04096031]])
In [118]:
shap.initjs()
shap.force_plot(explainer.expected_value,shap_values[0,:], dataset_aif_tranf_age.feature_names)
Out[118]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.
In [119]:
shap.initjs()
shap.force_plot(explainer.expected_value,shap_values[1,:], dataset_aif_tranf_age.feature_names)
Out[119]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.
In [120]:
shap.force_plot(explainer.expected_value, shap_values[:,:], X.iloc[:,:],feature_names = dataset_aif_tranf_age.feature_names)
Out[120]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.
In [121]:
shap.plots._waterfall.waterfall_legacy(explainer.expected_value, shap_values[0,:],feature_names=dataset_aif_tranf_age.feature_names)
In [122]:
perm_age = PermutationImportance(model).fit(dataset_aif_tranf_age.features, dataset_aif_tranf_age.labels.ravel())
perm_imp=eli5.show_weights(perm_age,feature_names = dataset_aif_tranf_age.feature_names)
perm_imp
plt.show()
Out[122]:
Weight Feature
0.0224 ± 0.0026 NumMonths
0.0212 ± 0.0055 Savings_<500
0.0073 ± 0.0092 Age
0.0004 ± 0.0016 OtherPayBackPlan_bank
0 ± 0.0000 Foreignworker
0 ± 0.0000 Debtors_none
0 ± 0.0000 EmployDuration_>=7 yr
0 ± 0.0000 Purpose_others
0 ± 0.0000 Collateral_unknown/none
0 ± 0.0000 Savings_500+
0 ± 0.0000 Marital_Status
0 ± 0.0000 Gender
-0.0069 ± 0.0020 CreditAmount

b) gender as protected variable

In [ ]:
 
In [123]:
mdl_gender = model.fit(dataset_aif_tranf_gender.features, dataset_aif_tranf_gender.labels.ravel())
In [124]:
# explain the model's predictions using SHAP
# (same syntax works for LightGBM, CatBoost, scikit-learn, transformers, Spark, etc.)
explainer = shap.TreeExplainer(grid_search.best_estimator_,dataset_aif_tranf_gender.features)
shap_values=explainer.shap_values(dataset_aif_tranf_gender.features, dataset_aif_tranf_gender.labels.ravel())
#https://github.com/slundberg/shap
In [125]:
shap_values
Out[125]:
array([[ 0.        ,  0.11156371,  0.00077599, ...,  0.00833379,
         0.        , -0.10627968],
       [ 0.        ,  0.07316086,  0.00077599, ...,  0.00833379,
         0.        ,  0.04096031],
       [ 0.        ,  0.06495592,  0.00077599, ...,  0.00833379,
         0.        ,  0.02756702],
       ...,
       [ 0.        , -0.18114036, -0.00346608, ..., -0.15743269,
         0.        ,  0.0271022 ],
       [ 0.        ,  0.0735417 ,  0.00315569, ...,  0.03174634,
         0.        ,  0.02756702],
       [ 0.        ,  0.07915238, -0.00067252, ...,  0.00833379,
         0.        ,  0.04096031]])
In [126]:
shap.initjs()
shap.force_plot(explainer.expected_value,shap_values[0,:], dataset_aif_tranf_gender.feature_names)
Out[126]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.
In [127]:
shap.initjs()
shap.force_plot(explainer.expected_value,shap_values[1,:], dataset_aif_tranf_gender.feature_names)
Out[127]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.
In [128]:
shap.force_plot(explainer.expected_value, shap_values[:,:], X.iloc[:,:],feature_names = dataset_aif_tranf_gender.feature_names)
Out[128]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.
In [129]:
shap.plots._waterfall.waterfall_legacy(explainer.expected_value, shap_values[0,:],feature_names=dataset_aif_tranf_gender.feature_names)
In [130]:
perm_gender = PermutationImportance(model).fit(dataset_aif_tranf_gender.features, dataset_aif_tranf_gender.labels.ravel())
perm_imp=eli5.show_weights(perm_gender,feature_names = dataset_aif_tranf_gender.feature_names)
perm_imp
plt.show()
Out[130]:
Weight Feature
0.0200 ± 0.0054 Savings_<500
0.0171 ± 0.0099 NumMonths
0.0086 ± 0.0075 Age
0 ± 0.0000 Foreignworker
0 ± 0.0000 OtherPayBackPlan_bank
0 ± 0.0000 Debtors_none
0 ± 0.0000 EmployDuration_>=7 yr
0 ± 0.0000 Purpose_others
0 ± 0.0000 Collateral_unknown/none
0 ± 0.0000 Savings_500+
0 ± 0.0000 Marital_Status
0 ± 0.0000 Gender
-0.0061 ± 0.0026 CreditAmount
In [ ]:
 

c) marital status as protected variable

In [ ]:
 
In [131]:
mdl_marital = model.fit(dataset_aif_tranf_marital.features, dataset_aif_tranf_marital.labels.ravel())
In [132]:
# explain the model's predictions using SHAP
# (same syntax works for LightGBM, CatBoost, scikit-learn, transformers, Spark, etc.)
explainer = shap.TreeExplainer(grid_search.best_estimator_,dataset_aif_tranf_marital.features)
shap_values=explainer.shap_values(dataset_aif_tranf_marital.features, dataset_aif_tranf_marital.labels.ravel())
#https://github.com/slundberg/shap
In [133]:
shap_values
Out[133]:
array([[ 0.        ,  0.11156371,  0.00077599, ...,  0.00833379,
         0.        , -0.10627968],
       [ 0.        ,  0.07316086,  0.00077599, ...,  0.00833379,
         0.        ,  0.04096031],
       [ 0.        ,  0.06495592,  0.00077599, ...,  0.00833379,
         0.        ,  0.02756702],
       ...,
       [ 0.        , -0.18114036, -0.00346608, ..., -0.15743269,
         0.        ,  0.0271022 ],
       [ 0.        ,  0.0735417 ,  0.00315569, ...,  0.03174634,
         0.        ,  0.02756702],
       [ 0.        ,  0.07915238, -0.00067252, ...,  0.00833379,
         0.        ,  0.04096031]])
In [134]:
shap.initjs()
shap.force_plot(explainer.expected_value,shap_values[0,:], dataset_aif_tranf_marital.feature_names)
Out[134]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.
In [135]:
shap.initjs()
shap.force_plot(explainer.expected_value,shap_values[1,:], dataset_aif_tranf_marital.feature_names)
Out[135]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.
In [136]:
shap.force_plot(explainer.expected_value, shap_values[:,:], X.iloc[:,:],feature_names = dataset_aif_tranf_marital.feature_names)
Out[136]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.
In [137]:
shap.plots._waterfall.waterfall_legacy(explainer.expected_value, shap_values[0,:],feature_names=dataset_aif_tranf_marital.feature_names)
In [138]:
perm_marital = PermutationImportance(model).fit(dataset_aif_tranf_marital.features, dataset_aif_tranf_marital.labels.ravel())
perm_imp=eli5.show_weights(perm_marital,feature_names = dataset_aif_tranf_marital.feature_names)
perm_imp
plt.show()
Out[138]:
Weight Feature
0.0176 ± 0.0055 Savings_<500
0.0106 ± 0.0098 NumMonths
0.0102 ± 0.0082 Age
0 ± 0.0000 Foreignworker
0 ± 0.0000 OtherPayBackPlan_bank
0 ± 0.0000 Debtors_none
0 ± 0.0000 EmployDuration_>=7 yr
0 ± 0.0000 Purpose_others
0 ± 0.0000 Collateral_unknown/none
0 ± 0.0000 Savings_500+
0 ± 0.0000 Marital_Status
0 ± 0.0000 Gender
-0.0041 ± 0.0026 CreditAmount
In [ ]: